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Abstract 

We derive a novel norm that corresponds to the tightest convex relax- 
ation of sparsity combined with an £2 penalty. We show that this new 
k-support norm provides a tighter relaxation than the elastic net and is 
thus a good replacement for the Lasso or the elastic net in sparse predic- 
tion problems. Through the study of the fc-support norm, we also bound 
the looseness of the elastic net, thus shedding new light on it and providing 
justification for its use. 



1 Introduction 

Regularizing with the l\ norm, when we expect a sparse solution to a regression 
problem, is often justified by ||iu||i being the "convex envelope" of ||u>|jo (the 
number of non-zero coordinates of a vector w £ M d ). That is, |j w\\ 1 is the tightest 
convex lower bound on 1 1 1 1 - But we must be careful with this statement — 
for sparse vectors with large entries, ||iu||o can be small while \\w\\x is large. 
In order to discuss convex lower bounds on ||w||o, we must impose some scale 
constraint. A more accurate statement is that ||w||i < HHIooll'HIo, and so, when 
the magnitudes of entries in w are bounded by 1, then ||w||i < ||w||o, an d indeed 
it is the largest such convex lower bound. Viewed as a convex outer relaxation, 

:= {w I Hlo < k, HU < 1} c {w I Hli < fc} • 
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Intersecting the right-hand-side with the unit ball, we get the tightest convex 
outer bound (convex hull) of : 

HllHli < MHU < 1} = convO^" ) . 

However, in our view, this relationship between ||io||i and ||iy||o yields disap- 
pointing learning guarantees, and does not appropriately capture the success of 
the l\ norm as a surrogate for sparsity. In particular, the sample complexit}0 of 
learning a linear predictor with k non-zero entries by empirical risk minimiza- 
tion inside this class (an NP-hard optimization problem) scales afH O(fclogd), 
but relaxing to the constraint \\w\\i < k yields a sample complexity which scales 
as 0(k 2 log d), because the sample complexity of ^-regularized learning scales 
quadratically with the £\ norm [TTJ QI5] . 

Perhaps a better reason for the l\ norm being a good surrogate for sparsity 
is that, not only do we expect the magnitude of each entry of w to be bounded, 
but we further expect ||w||2 to be small. In a regression setting, with a vector 
of features x, this can be justified when E[(x T w) 2 ] is bounded (a reasonable as- 
sumption) and the features are not too correlated — see, e.g. [TB]- More broadly, 
especially in the presence of correlations, we might require this as a model- 
ing assumption to aid in robustness and generalization. In any case, we have 
II w ||i < HHhVlMlo, an d so if we are interested in predictors with bounded £2 
norm, we can motivate the l\ norm through the following relaxation of sparsity, 
where the scale is now set by the £2 norm: 

{w I |H|o < k, 1Mb < B) C {w I ||w||i < BVk} . 

The sample complexity when using the relaxation now scales a 



Sparse + £2 constraint. Our starting point is then that of combining sparsity 
and £2 regularization, and learning a sparse predictor with small £2 norm. We 
are thus interested in classes of the form 

S ( k 2) :={w\\\w\\ <k,\\w\\ 2 <l} . 

As discussed above, the class {||w||i < s/k} (corresponding to the standard 

(2) 

Lasso) provides a convex relaxation of S k . But it is clear that we can get a 

1 We define this as the number of observations needed in order to ensure expected prediction 
error no more than e worse than that of the best fc-sparse predictor, for an arbitrary constant 
e (that is, we suppress the dependence on e and focus on the dependence on the sparsity k 
and dimensionality d). 

2 This is based on bounding the VC-subgraph dimension of this class, which is essentially 
the effective number of parameters. 

3 More precisely, the sample complexity is 0(B 2 k\ogd), where the dependence on B 2 is 
to be expected. Note that if feature vectors are £oo-bounded (i.e. individual features are 
bounded), the sample complexity when using only \\1vW2 < B (without a sparsity or i\ con- 
straint) scales as 0(B 2 d). That is, even after identifying the correct support, we still need a 
sample complexity that scales with B 2 . 
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tighter convex relaxation by keeping the 1% constraint as well: 

S ( k 2) C [w | ||w||i < Vk, \\w\\ 2 <l}c{ w | || w ||i < Vk} . (1) 

Constraining (or equivalently, penalizing) both the l\ and £2 norms, as in ([1]), 
is known as the "elastic net" [SJ [5D] and has indeed been advocated as a better 
alternative to the Lasso. In this paper, we ask whether the elastic net is the 
tightest convex relaxation to sparsity plus £2 (that is, to ) or whether a 
tighter, and better, convex relaxation is possible. 

A new norm. We consider the convex hull (tightest convex outer bound) of 

c(2) 

C k := conv(^ 2) ) = conv {w \ \\w\\ < k, \\w\\ 2 < 1} . (2) 

We study the gauge function associated with this convex set, that is, the norm 
whose unit ball is given by @, which we call the k -support norm. We show 
that, for k > 1, this is indeed a tighter convex relaxation than the elastic net 
(that is, both inequalities in (flj are in fact strict inequalities), and is therefore a 
better convex constraint than the elastic net when seeking a sparse, low ^2-uorm 
linear predictor. We thus advocate using it as a replacement for the elastic net. 

However, we also show that the gap between the elastic net and the /c-support 
norm is at most a factor of corresponding to a factor of two difference in 
the sample complexity. Thus, our work can also be interpreted as justifying the 
use of the elastic net, viewing it as a fairly good approximation to the tightest 
possible convex relaxation of sparsity intersected with an £2 constraint. Still, 
even a factor of two should not necessarily be ignored and, as we show in our 
experiments, using the tighter fc-support norm can indeed be beneficial. 

To better understand the /c-support norm, we show in Section [5] that it can 
also be described as the group lasso with overlaps norm [TU] corresponding to 
all (^) subsets of k features. Despite the exponential number of groups in this 
description, we show that the fc-support norm can be calculated efficiently in 
time 0(d log d) and that its dual is given simply by the £2 norm of the k largest 
entries. We also provide efficient first-order optimization algorithms for learning 
with the fc-support norm. 

Related Work In many learning problems of interest, Lasso has been ob- 
served to shrink too many of the variables of w to zero. In particular, in many 
applications, when a group of variables is highly correlated, the Lasso may pre- 
fer a sparse solution, but we might gain more predictive accuracy by including 
all the correlated variables in our model. These drawbacks have recently mo- 
tivated the use of various other regularization methods, such as the elastic net 
|20j , which penalizes the regression coefficients w with a combination of l\ and 
£2 norms: 

min|i||Xw-y|| 2 + Ai||u;||i + A2||t«||| : w e K d 1 , (3) 
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where for a sample of size n, y G IR n is the vector of response values, and 
X G M. nxd is a matrix with column j containing the values of feature j. 

The elastic net can be viewed as a trade-off between l\ regularization (the 
Lasso) and I2 regularization (Ridge regression [§]), depending on the relative 
values of Ai and A 2 . In particular, when A2 = 0, fl3J) is equivalent to the Lasso. 
This method, and the other methods discussed below, have been observed to 
significantly outperform Lasso in many real applications. 

The pairwise elastic net (PEN), proposed by [T3], has a penalty function 
that accounts for similarity among features: 

IIHI™ = IHI1 + IMI? -M T J*H, 

where R G [0, l] pxp is a matrix with Rjk measuring similarity between fea- 
tures Xj and Xk- The trace Lasso [6] is a second method proposed to handle 
correlations within X , defined by 

HIS"" = ||*diag(«,)||, , 

where || • ||* denotes the matrix trace- norm (the sum of the singular values) 
and promotes a low-rank solution. If the features are orthogonal, then both 
the PEN and the Trace Lasso are equivalent to the Lasso. If the features are 
all identical, then both penalties are equivalent to Ridge regression (penalizing 
|| HI 2)- Another existing penalty is OSCAR [3], given by 

| H |OSCAfi = | Mi + c ^ max{K | )K | } , 

j<k 

Like the elastic net, each one of these three methods also "prefers" averaging 
similar features over selecting a single feature. 

2 The /c-Support Norm 

One argument for the elastic net has been the flexibility of tuning the cardinality 
k of the regression vector w. Thus, when groups of correlated variables are 
present, a larger k may be learned, which corresponds to a higher A2 in (J3J. A 
more natural way to obtain such an effect of tuning the cardinality is to consider 
the convex hull of cardinality k vectors, 

C fe = conv(4 2) ) = com{w G K d | ||w|| < fc, H| 2 < 1}. 

Clearly the sets Ck are nested, and C\ and Cd are the unit balls for the t\ and 
£2 norms, respectively. Consequently we define the k-support norm as the norm 
whose unit ball equals Ck (the gauge function associated with the Ck ball) Q An 
equivalent definition is the following variational formula: 

4 The gauge function y Ch : R d -> MU{+oo} is defined as y Ch (x) = inf{A £ R + : x £ AC fe }. 
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Figure 1: Unit ball of the 2-support norm (left) and of the elastic net (right) on R 3 . 



Definition 2.1. Let k £ {1, . . . , d}. The fc-support norm || • ||^, p is defined, for 
every w £ M. d , as 



where Gk denotes the set of all subsets of {1, . . . , d} of cardinality at most k. 

The equivalence is immediate by rewriting vi — fiizi in the above definition, 
where /i/ > 0,z/ £ Cjt,V7 £ Gk, J2ieg k I 11 = •"•* ^ n Edition, this immediately 
implies that || • ||^, p is indeed a norm. In fact, the fc-support norm is equivalent to 
the norm used by the group lasso with overlaps |10) , when the set of overlapping 
groups is chosen to be Gk (however, the group lasso has traditionally been 
used for applications with some specific known group structure, unlike the case 
considered here). 

Although the variational definition 12.11 is not amenable to computation be- 
cause of the exponential growth of the set of groups Gk, the fc-support norm 
is computationally very tractable, with an 0(dlogc?) algorithm described in 
Section O 

As already mentioned, || • ||i P = || • ||i and || • = || • ||2- The unit ball of this 
new norm in M 3 for k = 2 is depicted in Figure [T] We immediately notice several 
differences between this unit ball and the elastic net unit ball. For example, at 
points with cardinality k and li norm equal to 1, the fc-support norm is not 
differentiable, but unlike the t\ or elastic-net norm, it is differentiable at points 
with cardinality less than fc. Thus, the fc-support norm is less "biased" towards 
sparse vectors than the elastic net and the t\ norm. 



w 



sp 
k 
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2.1 The Dual Norm 



It is interesting and useful to compute the dual of the fc-support norm. We 
follow the notation of [5] for ordered vectors: for any w £ R d , \w\ is the vector 



of absolute values, and w\ is the i-th largest element of w. We have 
\\u\\f = max{( Wlll ) : Hlf < 1} = max {[ V uj ) / £ c7, 




= E(Ht) 



^ =:||«Hi8 



v i=l 



This is the i^-norm of the largest fc entries in u, and is known as the 2-fc 
symmetric gauge norm [2J. 

Not surprisingly, this dual norm interpolates between the £2 norm (when 
k = d and all entries are taken) and the £ ca norm (when k = 1 and only the 
largest entry is taken). This parallels the interpolation of the fc-support norm 
between the £\ and £2 norms. 

Like the £ p norms and elastic net, the fc-support norm and its dual are 
symmetric gauge functions, that is, sign- and permutation-invariant norms. For 
properties of such norms, see [2]. 

2.2 Computation of the Norm 

In this section, we derive an alternative formula for the fc-support norm, which 
leads to computation of the value of the norm in 0(d log d) steps. 

Proposition 2.1. For every w € M. d , 

iMir 

where, letting \w\q denote +00, r is the unique integer in {0, . . . , fc— 1} satisfying 



' k-r-l / d 

E(Nt) 2 + £ 

, i—1 \i—k— 




Ht-r-l > ~ 

i—k—r 



^ E \w\j > \w\i_ r . (4) 



This result shows that || • trades off between the £\ and £2 norms in a way 
that favors sparse vectors but allows for cardinality larger than fc. It combines 
the uniform shrinkage of an £2 penalty for the largest components, with the 
sparse shrinkage of an £\ penalty for the smallest components. 

Proof of Proposition 12.11 We will use the inequality (w,u) < (w^,u^) [7]. 
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We have 
1 



~(Hf ) 2 = max !<«,«) - -(M|g) 2 : u G R d 

Li=l i=l J 

fc-1 d 1 ^ 

^ ai\w\\ + a k J2 Wt - - ^> 2 : ai > ■ • • > a fe > 



max< 



2 

i=fc l= 1 



d i 

Let A r := Mi for r e {0, . . . , fc - 1}. If A < Mfc-i trien the solution 

i—k—r 

a is given by on — \w\f for i = 1, . . . , (k — 1), a,; = Ao for i = k, . . . , d. If 

^4o > IHfc-i then the optimal otu, otk-i lie between |ui|t-i an d A)> an d have to 
be equal. So, the maximization becomes 



'fc-2 fc-2 
, i—l i—1 



max<( ^ ] oti\w\\ - ^a- + Aafc-i - a 2 k _ 1 : a\ > ■ ■■ > a k -i > 



If A) > l w lt-i an d Mfe-2 > 4 1 * nen * ne solution is a.i — \w\j for i = 1, . . . , (fe- 
tor i = (fc — 1), . . . , d. Otherwise we proceed as before and continue 
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this process. At stage r the process terminates if Aq > \w\j 1 _ 1 , . . . , — 1 > 



4- A r ^ |„„|J. 

fc-r> 



T < l w lfc-r-i ano - au °ut the last two inequalities are redundant. 
Hence the condition can be rewritten as ((4]). One optimal solution is on = 
for i = 1, . . . , fc — r — 1, ai = for i = k — r, . . . , d. This proves the claim. ■ 

2.3 Learning with the A>support norm 

We thus propose using learning rules with fc-support norm regularization. These 
are appropriate when we would like to learn a sparse predictor that also has low 
£2 norm, and are especially relevant when features might be correlated (that 
is, in almost all learning tasks) but the correlation structure is not known in 
advance. For regression problems with squared error loss, the resulting learning 
rule is of the form 

±\\Xv,-vf + ±(\\w\\??:w€R d } (5) 

with A > a regularization parameter and k G {1 , . . . , d} also a parameter 
to be tuned. As typical in regularization-bascd methods, both A and k can 
be selected by cross validation [5]. Although we have motivated this norm by 

(2) 

considering S^. , the set of fc-sparse unit vectors, the parameter k does not 
necessarily correspond to the sparsity level of the fitted vector of coefficients, 
and should be chosen via cross-validation independently of the desired sparsity 
level. 
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3 Relation to the Elastic Net 



Recall that the elastic net with penalty parameters Ai and A2 selects a vector 
of coefficients given by 

argmin ji||X W -y|| 2 + Ai|H|i + A 2 |H|2j . (6) 

For ease of comparison with the fc-support norm, we first show that the set 
of optimal solutions for the elastic net, when the parameters are varied, is the 
same as for the norm 

II II ei f|| || IMIl 

IMU : = max-j|M| 2j — f=- 

when fc € [1, d] , corresponding to the unit ball in (fT]) (note that fc is not 
necessarily an integer). To see this, let w be a solution to ([6]), and let k := 

(\\w\\i/\\wh) 2 e[l,d}. 

Then for any w ^ w, if ||u>||| z < \\w\\f , then \\w\\ p < \\w\\ p for p = 1, 2. Since 
w is a solution to ([5]), therefore, ||Xu> — > ||-X"t6 — l/lli - This proves that, for 
some constraint parameter B, 



w — argmin i — \\Xw — y\\l : \\w\\t < B 
y n 

Like the fc-support norm, the elastic net interpolates between the l\ and £2 
norms. In fact, when k is an integer, any fc-sparse unit vector w £ M. d must lie 
in the unit ball of || • ||| . Since the fc-support norm gives the convex hull of all 
fc-sparse unit vectors, this immediately implies that 



\k 



1 < \\w\\l p V»e 



The two norms are not equal, however. The difference between the two is 
illustrated in FigureQ] where we see that the fc-support norm is more "rounded" . 

To see an example where the two norms are not equal, we set d = 1 + k 2 for 
some large fc, and let w = (fc 15 , 1,1,..., I) T G R d . Then 

Hl^ = max|v^3 + i^, fcl '^t fc2 | =fc 1 - 5 (l + ^V 

Taking u = (-^=, . . . , ^) T , we have ||u||^ < 1, and recalling this 

norm is dual to the fc-support norm: 

\\wfv > ( W} u ) = *^ + fc 2 • -L= = V2 ■ fc 1 ' 5 . 
fe y/2 V2k 

In this example, we see that the two norms can differ by as much as a factor of 
y/2. We now show that this is actually the most by which they can differ. 

Proposition 3.1. || • < || • ||f < V2\\ ■ \\f. 
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Proof. We show that these bounds hold in the duals of the two norms. First, 
since || • |||' is a maximum over the i\ and li norms, its dual is given by 

|M|^ r := inf f||a|| 2 + V*-|l«- alloc) 

Now take any u £ M d . First wc show < ||u||jj. ■ Without loss of 

generality, we take u\ > ■ ■ ■ > Ud > 0. For any a € R d , 

IMI(fc) = Il u i:fell2 < 1 1 o-i:fc 1 1 2 + ||ui:fc - a i : fc 1 1 2 < ||a|| 2 + Vk\\u - a\\oo . 



Finally, we show that ||w||j. < \/2||w||Lj. Taking a — (u\ — Wfc+i 
Uk+i, 0, . . . , 0) T , we have 



\\u\\i eir <\\a\\ 2 + Vk- \\u- a|U = 



\ y^x^ ~ "fc+i) 2 + Vk\uk+i\ 



< 



2 _ ,.2 -j 



\ i=i 



Furthermore, this yields a strict inequality, because if ui > Ufe+i, the next- 
to-last inequality is strict, while if u\ = ■ ■■ = ttk+i, then the last inequality is 
strict. ■ 



4 Optimization 

Solving the optimization problem ([5]) efficiently can be done with a first-order 
proximal algorithm. Proximal methods - see [TJ 31 [T51 [T71 [TB] and references 
therein - are used to solve composite problems of the form min{/(a;) + uj{x) : 
x G R d }, where the loss function f(x) and the regularize! - lu(x) are convex 
functions, and / is smooth with an L-Lipschitz gradient. These methods require 
fast computation of the gradient V/ and the proximity operator 

prox w (a;) := argmin | — \\u — x\\ 2 + ui(u) : u € R d j- . 

In particular, accelerated first-order methods, proposed by Nesterov [141 115] re- 
quire two levels of memory at each iteration and exhibit an optimal O (w) 
convergence rate for the objective after T iterations. 

To obtain a proximal method for fc-support regularization, it suffices to com- 
pute the proximity map of g = ^(|| • ||fe P ) 2 , for any L > 0. This can be done in 
0(d(k + logrf)) steps with Algorithm[TJ 
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Algorithm 1 Computation of the proximity operator. 
Input v e R d 

Output q = proxj_ ( -||.||sp^ 2 (f) 

Find r € {0, . . . , k - 1}, £ € {k, . . . , d} such that 



L+l Zk - r - 1 > e-k+(L+l)r+L+l — L+l Zfc - r CO 



ze > t_ k+i L+u r+L+1 > z£+i (8) 



-k+(L+l)r+L+ 

I 

where z := \v\^, z a := +oo, := -oo, T r> i := J2 z i 

i—k—r 

j^jZi if i = 1, . . . , fc - r - 1 

?i <— ^ «i — £-fe+(i+i) r +L+i if i = k — r, . . . ,£ 
i£i=£ + l,...,d 

Reorder and change signs of q to conform with v 



Algorithm 2 Accelerated fc-support regularization. 



Wi=a 1 e R<\ 0i <- 1 
for t=l,2,. . . do 

1 + ^/1+4^ 



Wt+i <— prox_i_^||.||spj 2 (at — j^X T (Xat — y)) using Algorithm!]] 
a t +i <- w t+ i + ^0(w t+ i - w t ) 
end for 



Proof of Correctness of Algorithm Since the support-norm is sign and 
permutation invariant, prox(u) has the same ordering and signs as v. Hence, 
without loss of generality, we may assume that vi > ■ ■ ■ > Vd > and require 
that qi > ■ ■ ■ > qd > 0, which follows from inequality J7J and the fact that z is 
ordered. 

Now, q = prox(w) is equivalent to Lz — Lq = Lv — Lq € 0^(\\ ■ \\ s k p ) 2 (q) ■ It 
suffices to show that, for w = q, Lz — Lq is an optimal a in the proof of Proposi- 
ti t , T 
tionHTQ Indeed, A r corresponds to J2 1i = J2 \ z i - e - k +{L+i)r+L+i 

i—k—r i—k—r 

T r,t - e-l+U+^r+L+i = ( r + 1 ) <-fc+(L+i/r+L+i and ® is equivalent to condi- 
tion (jjj)- For i < fc — r — 1, we have Lz^ — Lqt — qi. For fc — r < i < t, we have 
Lzi — Lqi = ^q^A r . For i > 1, since = 0, we only need Lzi — Lqi < -^jA r , 
which is true by ([5]). ■ 

We can now apply a standard accelerated proximal method, such as FISTA 
PP, to (O, at each iteration using the gradient of the loss and performing a prox 
step using Algorithm!]] The FISTA guarantee ensures us that, with appropriate 
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step sizes, after T such iterations, we have: 

h \X WT -y\f4 (IKIIf) 2 < ( hx^-vf+i (\\™T k P fV 2LllW *- Wl11 



2" a " 2 " fc ' - \ 2" ' a " 2 v " " fc ' / (T + 1) 



5 Empirical Comparisons 

Our theoretical analysis indicates that the fc-support norm and the elastic net 
differ by at most a factor of \/2, corresponding to at most a factor of two 
difference in their sample complexities and generalization guarantees. We thus 
do not expect huge differences between their actual performances, but would 
still like to see whether the tighter relaxation of the fc-support norm does yield 
some gains. 

Synthetic Data For the first simulation we follow Sec. 5, example 4]. In 
this experimental protocol, the target (oracle) vector equals 

w* = (3, ...,3,0...,0) , 

15 25 

with y = (w*) T x+M(0,l). 

The input data X were generated from a normal distribution such that 
components 1, . . . , 5 have the same random mean Z\ ~ A/"(0, 1), components 
6, . . . , 10 have mean Z2 ~ A/"(0, 1) and components 11, . . . , 15 have mean Z3 ~ 
jV(0, 1). A total of 50 data sets were created in this way, each containing 50 
training points, 50 validation points and 350 test points. The goal is to achieve 
good prediction performance on the test data. 

We compared the fc-support norm with Lasso and the elastic net. We con- 
sidered the ranges k = {1, . . . , d} for fc-support norm regularization, A = 10* , 
i = { — 15, . . . , 5}, for the regularization parameter of Lasso and fc-support regu- 
larization and the same range for the X%, A2 of the elastic net. For each method, 
the optimal set of parameters was selected based on mean squared error on the 
validation set. The error reported in Table [5] is the mean squared error with 
respect to the oracle w* , namely MSE = (w — w*) T V(w — to*), where V is the 
population covariance matrix of X tes t- 

Beyond the predictive gains, to further illustrate the effect of the fc-support 
norm, in Figure [5] we show the coefficients learned by each method, in absolute 
value. For each image, one row corresponds to the w learned for one of the 
50 data sets. Whereas the elastic net can learn higher values at the relevant 
features, a better feature pattern with less variability emerges when using the 
fc-support norm. 

South African Heart Data This is a classification task which has been 
used in 0. There are 9 variables and 462 examples, and the response is pres- 
ence/absence of coronary heart disease. We normalized the data so that each 
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Figure 2: Solutions learned by each method for all the simulation data sets. Left to 
right: fc-support, Lasso and elastic net. 



Table 1: Mean squared errors and classification accuracy for the synthetic data (me- 
dian over 50 repetition), SA heart data (median over 50 replications) and for the "20 
newsgroups" data set. (SE = standard error) 





Synthetic 


Heart 


Newsgroups 


Method 


MSE (SE) 


MSE (SE) 


Accuracy (SE) 


MSE 


Accuracy 


Lasso 
Elastic net 
/c-support 


0.2746 (0.02) 
0.3119 (0.03) 
0.2342 (0.02) 


0.18 (0.005) 
0.18 (0.005) 
0.18 (0.005) 


66.41 (0.53) 
66.41 (0.53) 
66.41 (0.53) 


0.70 
0.71 
0.69 


73.02 
72.53 
73.40 



predictor variable has zero mean and unit variance. We then split the data 50 
times randomly into training, validation, and test sets of sizes 400, 30, and 32 
respectively. For each method, parameters were selected using the validation 
data. In Tables [5j we report the MSE and accuracy of each method on the test 
data. We observe that all three methods have identical performance. 

20 Newsgroups This is a binary classification version of 20 newsgroups cre- 
ated in [T2] which can be found in the LIBSVM data repository^ The positive 
class consists of the 10 groups with names of form sci.*, comp.*, or misc.forsale 
and the negative class consists of the other 10 groups. To reduce the num- 
ber of features, we removed the words which appear in less than 3 documents. 
We randomly split the data into a training, a validation and a test set of sizes 
14000,1000 and 4996, respectively. We report MSE and accuracy on the test 
data in Table [3] We found that /c-support regularization gave improved predic- 
tion accuracy over both other methods^) 

6 Summary 

We introduced the /c-support norm as the tightest convex relaxation of sparsity 
plus (.2 regularization, and showed that it is tighter than the elastic net by 

5 http : //www. csie .ntu. edu. tw/~cjlin/libsvmtools/datasets/ 

6 Regarding other sparse prediction methods, we did not manage to compare with OSCAR, 
due to memory limitations, or to PEN or trace Lasso, which do not have code available online. 
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exactly a factor of y/2. In our view, this sheds light on the elastic net as a 
close approximation to this tightest possible convex relaxation, and motivates 
using the fc-support norm when a tighter relaxation is sought. This is also 
demonstrated in our empirical results. 

We note that the fc-support norm has better prediction properties, but not 
necessarily better sparsity-inducing properties, as evident from its more rounded 
unit ball. It is well understood that there is often a tradeoff between sparsity 
and good prediction, and that even if the population optimal predictor is sparse, 
a denser predictor often yields better predictive performance 131 [TO]- For 
example, in the presence of correlated features, it is often beneficial to include 
several highly correlated features rather than a single representative feature. 
This is exactly the behavior encouraged by £2 norm regularization, and the 
elastic net is already known to yield less sparse (but more predictive) solutions. 
The fc-support norm goes a step further in this direction, often yielding solutions 
that are even less sparse (but more predictive) compared to the elastic net. 

Nevertheless, it is interesting to consider whether compressed sensing results, 
where l\ regularization is of course central, can be refined by using the /c-support 
norm, which might be able to handle more correlation structure within the set 
of features. 
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